Capability
Focuses on whether a model is capable of performing a task
- Advantages
- Fairly simple
- Enough for now
- Disadvantages
- Susceptible to deceptive models
- May not be enough in the future
- Easy for us to lock ourselves in a particular eval framework that becomes mainstream -- but is insufficient
Alignment
Focuses on the propensity of a model to use dangerous capabilities.
- Disadvantages
- Generally difficult to implement since misalignment/failure modes for alignment are elusive
- Note/s
- Does not necessarily mean mechanistic interpretability
- Understanding-based Evaluations that are currently insufficient, but may be starting points:
- Causal Scrubbing
- A principled approach to evaluating the quality of mechanistic interpretations
- A systematic Ablation method for testing precisely stated hypotheses about how a particular neural network implements a behavior on a dataset.
- Auditing Games
- technique for evaluating interpretability tools, not a technique for evaluating the extent to which we understand a model
- Prediction-based Evaluation
- Causal Scrubbing
Â